fix: EAGLE mix_hidden_states in-place op crash (#1088) by javierdejesusda · Pull Request #1104 · NVIDIA/Model-Optimizer

javierdejesusda · 2026-03-23T19:43:43Z

Type of change

Bug fix (non-breaking change which fixes an issue)

Description

Fixes #1088 — RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: IndexPutBackward0 when training with eagle_mix_hidden_states=True.

Root cause: In HFEagleModel._eagle_training_forward, the indexed assignment at line 991–994 modifies eagle_input_hiddens in-place while it is still part of the autograd computation graph.

Fix: Clone the tensor before the in-place assignment. This is the same pattern already used in the Megatron backend at megatron_eagle.py:1201-1202:

# Clone to avoid inplace modification of view created in no_grad mode
eagle_module_input_hidden_states = eagle_module_input_hidden_states.clone()

The HF backend was missing this clone.

Usage

config["eagle_mix_hidden_states"] = True
config["eagle_ttt_steps"] = 2
mtsp.convert(model, mode=[("eagle", config)])
model.train()
outputs = model(input_ids=input_ids, labels=labels)
outputs.loss.backward()  # no longer crashes

Testing

Added test_eagle_mix_hidden_states_backward parametrized over eagle_ttt_steps [1, 2] that:

Converts a tiny LLaMA to EAGLE with eagle_mix_hidden_states=True
Runs forward + backward pass
Asserts loss is not None and gradients flow to eagle_module

pytest tests/unit/torch/speculative/plugins/test_hf_speculative.py::test_eagle_mix_hidden_states_backward -v

Checklist

I have read the contributor guidelines and signed my commits
I have followed the security best practices
This change is backward compatible
I have followed third-party code and dependency guidelines
I have added tests that prove my fix is effective

Summary by CodeRabbit

Bug Fixes
- Fixed gradient computation issue in speculative decoding during model training to ensure proper autograd behavior.
Tests
- Added regression test to validate gradient computation in speculative decoding scenarios.

copy-pr-bot · 2026-03-23T19:43:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-23T19:44:04Z

📝 Walkthrough

Walkthrough

A fix prevents autograd crashes when EAGLE's mix hidden states mode is enabled by cloning a tensor before in-place indexing. A regression test validates that gradients flow correctly through the EAGLE module during training with mix hidden states enabled.

Changes

Cohort / File(s)	Summary
EAGLE autograd fix `modelopt/torch/speculative/plugins/transformers.py`	Added `clone()` call before in-place tensor indexing in `HFEagleModel.forward()` to prevent modifying autograd-tracked tensors derived from `roll()` operations.
Regression test `tests/unit/torch/speculative/plugins/test_hf_speculative.py`	Added parametrized test `test_eagle_mix_hidden_states_backward` to validate gradient computation flows correctly through EAGLE module when mix hidden states is enabled with varying `eagle_ttt_steps` values (1, 2).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main fix: addressing an in-place operation crash in EAGLE's mix_hidden_states feature during training.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#1088` by fixing the autograd RuntimeError from in-place tensor modifications through cloning before indexed assignment, and adds a regression test validating gradient flow.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the in-place operation crash and validating the fix through a regression test, with no unrelated modifications present.
Security Anti-Patterns	✅ Passed	PR changes do not introduce unsafe torch.load(), numpy.load(), trust_remote_code=True, eval()/exec(), # nosec comments, or new non-permissive dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unit/torch/speculative/plugins/test_hf_speculative.py (1)

79-80: Consider making the test deterministic.

A fixed seed before random input generation can make CI failures easier to reproduce.

♻️ Optional tweak

 def test_eagle_mix_hidden_states_backward(eagle_config, eagle_ttt_steps):
@@
-    input_ids = torch.randint(0, model.config.vocab_size, (2, 16))
+    torch.manual_seed(0)
+    input_ids = torch.randint(0, model.config.vocab_size, (2, 16))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/speculative/plugins/test_hf_speculative.py` around lines 79
- 80, The test uses nondeterministic inputs via torch.randint for
input_ids/labels; make it deterministic by setting a fixed RNG seed before
generating those tensors (e.g., call torch.manual_seed with a constant value,
and if GPU tensors may be used also torch.cuda.manual_seed_all) so that
input_ids and labels in test_hf_speculative.py are reproducible across CI runs;
place the seed call immediately before the torch.randint line that creates
input_ids to ensure determinism.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/torch/speculative/plugins/test_hf_speculative.py`:
- Around line 79-80: The test uses nondeterministic inputs via torch.randint for
input_ids/labels; make it deterministic by setting a fixed RNG seed before
generating those tensors (e.g., call torch.manual_seed with a constant value,
and if GPU tensors may be used also torch.cuda.manual_seed_all) so that
input_ids and labels in test_hf_speculative.py are reproducible across CI runs;
place the seed call immediately before the torch.randint line that creates
input_ids to ensure determinism.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4d359ff0-2b8f-4ce2-baa9-bc9491e18455

📥 Commits

Reviewing files that changed from the base of the PR and between b61fb4e and d963bfc.

📒 Files selected for processing (2)

modelopt/torch/speculative/plugins/transformers.py
tests/unit/torch/speculative/plugins/test_hf_speculative.py

codecov · 2026-04-16T23:39:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.58%. Comparing base (4e33368) to head (f57f3ee).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1104      +/-   ##
==========================================
+ Coverage   72.74%   75.58%   +2.84%     
==========================================
  Files         459      459              
  Lines       48611    48612       +1     
==========================================
+ Hits        35361    36745    +1384     
+ Misses      13250    11867    -1383

Flag	Coverage Δ
unit	`52.39% <100.00%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Clone eagle_input_hiddens before indexed assignment to avoid in-place modification of a tensor in the autograd graph, which causes RuntimeError during backward pass. Mirrors the existing fix in the Megatron backend (megatron_eagle.py:1201-1202). Add regression test parametrized over eagle_ttt_steps [1, 2]. Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>

yeyu-nvidia · 2026-04-17T15:59:44Z

/ok to test 9d8f32f

javierdejesusda requested a review from a team as a code owner March 23, 2026 19:43

javierdejesusda requested a review from yeyu-nvidia March 23, 2026 19:43

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

yeyu-nvidia approved these changes Mar 23, 2026

View reviewed changes

javierdejesusda requested a review from yeyu-nvidia March 28, 2026 15:41

yeyu-nvidia approved these changes Mar 30, 2026

View reviewed changes

yeyu-nvidia enabled auto-merge (squash) March 30, 2026 16:27

yeyu-nvidia force-pushed the fix/eagle-mix-hidden-states-inplace-crash-1088 branch from d963bfc to 9d8f32f Compare April 17, 2026 00:23

Merge branch 'main' into fix/eagle-mix-hidden-states-inplace-crash-1088

f57f3ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: EAGLE mix_hidden_states in-place op crash (#1088)#1104

fix: EAGLE mix_hidden_states in-place op crash (#1088)#1104
javierdejesusda wants to merge 2 commits intoNVIDIA:mainfrom
javierdejesusda:fix/eagle-mix-hidden-states-inplace-crash-1088

javierdejesusda commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

yeyu-nvidia commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

javierdejesusda commented Mar 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Description

Usage

Testing

Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yeyu-nvidia commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

javierdejesusda commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

codecov bot commented Apr 16, 2026 •

edited

Loading